Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[observability][export-api] Write task events #47193

Merged
merged 19 commits into from
Sep 5, 2024

Conversation

nikitavemuri
Copy link
Contributor

@nikitavemuri nikitavemuri commented Aug 19, 2024

Why are these changes needed?

  • Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
  • All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well.
    • The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush.
  • Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second).
  • Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Nikita Vemuri added 7 commits August 16, 2024 16:37
wip
Signed-off-by: Nikita Vemuri <[email protected]>
Signed-off-by: Nikita Vemuri <[email protected]>
Signed-off-by: Nikita Vemuri <[email protected]>
Signed-off-by: Nikita Vemuri <[email protected]>
Signed-off-by: Nikita Vemuri <[email protected]>
Signed-off-by: Nikita Vemuri <[email protected]>
Signed-off-by: Nikita Vemuri <[email protected]>
@@ -180,7 +180,9 @@ EventManager &EventManager::Instance() {
return instance_;
}

bool EventManager::IsEmpty() { return reporter_map_.empty(); }
bool EventManager::IsEmpty() {
return reporter_map_.empty() && export_log_reporter_map_.empty();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup from #47143

Signed-off-by: Nikita Vemuri <[email protected]>
<< ExportEvent_SourceType_Name(export_event.source_type())
<< ". This indicates a bug in the code, and the event will be dropped.";
}
RAY_CHECK(element != export_log_reporter_map_.end());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup from #47143

Signed-off-by: Nikita Vemuri <[email protected]>
@nikitavemuri nikitavemuri added the go add ONLY when ready to merge, run all tests label Aug 19, 2024
Signed-off-by: Nikita Vemuri <[email protected]>
@nikitavemuri nikitavemuri marked this pull request as ready for review August 19, 2024 17:08
@nikitavemuri nikitavemuri requested a review from a team as a code owner August 19, 2024 17:08
@rkooo567 rkooo567 self-assigned this Aug 20, 2024
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I am not 100% sure if it is a good idea to aggregate events ourselves because it means we will have indeterministic data in the log depending on when we flush. But I wil just let it go for now (we should definitely talk to customers if it is semantic they want)

/// Convert rpc::TaskLogInfo to rpc::ExportTaskEventData::TaskLogInfo
inline void TaskLogInfoToExport(const rpc::TaskLogInfo &src,
rpc::ExportTaskEventData::TaskLogInfo *dest) {
dest->set_stdout_file(src.stdout_file());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: can we just swap here? Is the caller still using src after it is passed?

dest->Swap(&src)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called with src being state_update_->task_log_info_.value() so we probably shouldn't change src because it's a TaskStatusEvent class attribute

int64_t timestamp,
rpc::ExportTaskEventData::TaskStateUpdate *state_updates) {
if (task_status == rpc::TaskStatus::NIL) {
// Not status change.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Not status change.
// No status change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qq; when does it happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like NIL is the default value when a TaskStatusEvent is initialized. @rickyyx Do you know if FillTaskStatusUpdateTime is ever called when the task_status = rpc::TaskStatus::NIL? This logic is mirrored from that function

@@ -211,6 +289,7 @@ bool TaskEventBufferImpl::Enabled() const { return enabled_; }

void TaskEventBufferImpl::GetTaskStatusEventsToSend(
std::vector<std::unique_ptr<TaskEvent>> *status_events_to_send,
std::vector<std::unique_ptr<TaskEvent>> *dropped_status_events_to_write,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of maintaining dropped tasks, can we just maintain a new vector just for export events? And we can just make it 10X original vector. I think maintaining another dropped tasks that are not dropped is not intuitive + it can easily screw up ordering depending on task drop policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I needed to make the buffers for these events contain a shared_ptr rather than unique_ptr though so the same object can be moved to multiple buffers.

Copy link
Contributor

@rkooo567 rkooo567 Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can you just run microbenchmark to see if there's any noticeable regression (I think it shouldn't affect but just in case)? lmk if you don't know how to run them!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 20, 2024
@nikitavemuri
Copy link
Contributor Author

Btw, I am not 100% sure if it is a good idea to aggregate events ourselves because it means we will have indeterministic data in the log depending on when we flush.

Makes sense, we can revisit this logic after this initial prototype. The tradeoff here is minimizing the number of events that get written

Nikita Vemuri added 7 commits August 21, 2024 22:28
Signed-off-by: Nikita Vemuri <[email protected]>
Signed-off-by: Nikita Vemuri <[email protected]>
Signed-off-by: Nikita Vemuri <[email protected]>
Signed-off-by: Nikita Vemuri <[email protected]>
Signed-off-by: Nikita Vemuri <[email protected]>
Signed-off-by: Nikita Vemuri <[email protected]>
Signed-off-by: Nikita Vemuri <[email protected]>
@rkooo567
Copy link
Contributor

okay, microbenhcmark result looks good. is this PR ready to review again
?

@nikitavemuri
Copy link
Contributor Author

okay, microbenhcmark result looks good. is this PR ready to review again ?

yes

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well.
The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush.
Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second).
Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file.

Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
- Re add code changes from [observability][export-api] Write task events ray-project#47193, which was previous reverted due to CI test linux://:task_event_buffer_test is consistently_failing ray-project#47519, CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523 and CI test darwin://:task_event_buffer_test is consistently_failing ray-project#47525
- Was able to reproduce the failures locally and fixed test in 07efa6f. Failure was due to logical merge conflict (previous PR wasn't re-based off latest master after other event PRs were merged).

Signed-off-by: Nikita Vemuri <[email protected]>
Co-authored-by: Nikita Vemuri <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well.
The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush.
Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second).
Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file.

Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
- Re add code changes from [observability][export-api] Write task events ray-project#47193, which was previous reverted due to CI test linux://:task_event_buffer_test is consistently_failing ray-project#47519, CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523 and CI test darwin://:task_event_buffer_test is consistently_failing ray-project#47525
- Was able to reproduce the failures locally and fixed test in 07efa6f. Failure was due to logical merge conflict (previous PR wasn't re-based off latest master after other event PRs were merged).

Signed-off-by: Nikita Vemuri <[email protected]>
Co-authored-by: Nikita Vemuri <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well.
The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush.
Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second).
Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file.

Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
- Re add code changes from [observability][export-api] Write task events ray-project#47193, which was previous reverted due to CI test linux://:task_event_buffer_test is consistently_failing ray-project#47519, CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523 and CI test darwin://:task_event_buffer_test is consistently_failing ray-project#47525
- Was able to reproduce the failures locally and fixed test in 07efa6f. Failure was due to logical merge conflict (previous PR wasn't re-based off latest master after other event PRs were merged).

Signed-off-by: Nikita Vemuri <[email protected]>
Co-authored-by: Nikita Vemuri <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well.
The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush.
Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second).
Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file.

Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
- Re add code changes from [observability][export-api] Write task events ray-project#47193, which was previous reverted due to CI test linux://:task_event_buffer_test is consistently_failing ray-project#47519, CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523 and CI test darwin://:task_event_buffer_test is consistently_failing ray-project#47525
- Was able to reproduce the failures locally and fixed test in 07efa6f. Failure was due to logical merge conflict (previous PR wasn't re-based off latest master after other event PRs were merged).

Signed-off-by: Nikita Vemuri <[email protected]>
Co-authored-by: Nikita Vemuri <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well.
The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush.
Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second).
Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file.

Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
- Re add code changes from [observability][export-api] Write task events ray-project#47193, which was previous reverted due to CI test linux://:task_event_buffer_test is consistently_failing ray-project#47519, CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523 and CI test darwin://:task_event_buffer_test is consistently_failing ray-project#47525
- Was able to reproduce the failures locally and fixed test in 07efa6f. Failure was due to logical merge conflict (previous PR wasn't re-based off latest master after other event PRs were merged).

Signed-off-by: Nikita Vemuri <[email protected]>
Co-authored-by: Nikita Vemuri <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well.
The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush.
Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second).
Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file.

Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
- Re add code changes from [observability][export-api] Write task events ray-project#47193, which was previous reverted due to CI test linux://:task_event_buffer_test is consistently_failing ray-project#47519, CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523 and CI test darwin://:task_event_buffer_test is consistently_failing ray-project#47525
- Was able to reproduce the failures locally and fixed test in 07efa6f. Failure was due to logical merge conflict (previous PR wasn't re-based off latest master after other event PRs were merged).

Signed-off-by: Nikita Vemuri <[email protected]>
Co-authored-by: Nikita Vemuri <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well.
The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush.
Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second).
Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file.

Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
- Re add code changes from [observability][export-api] Write task events ray-project#47193, which was previous reverted due to CI test linux://:task_event_buffer_test is consistently_failing ray-project#47519, CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523 and CI test darwin://:task_event_buffer_test is consistently_failing ray-project#47525
- Was able to reproduce the failures locally and fixed test in 07efa6f. Failure was due to logical merge conflict (previous PR wasn't re-based off latest master after other event PRs were merged).

Signed-off-by: Nikita Vemuri <[email protected]>
Co-authored-by: Nikita Vemuri <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well.
The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush.
Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second).
Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file.

Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
- Re add code changes from [observability][export-api] Write task events ray-project#47193, which was previous reverted due to CI test linux://:task_event_buffer_test is consistently_failing ray-project#47519, CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523 and CI test darwin://:task_event_buffer_test is consistently_failing ray-project#47525
- Was able to reproduce the failures locally and fixed test in 07efa6f. Failure was due to logical merge conflict (previous PR wasn't re-based off latest master after other event PRs were merged).

Signed-off-by: Nikita Vemuri <[email protected]>
Co-authored-by: Nikita Vemuri <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well.
The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush.
Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second).
Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file.

Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
- Re add code changes from [observability][export-api] Write task events ray-project#47193, which was previous reverted due to CI test linux://:task_event_buffer_test is consistently_failing ray-project#47519, CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523 and CI test darwin://:task_event_buffer_test is consistently_failing ray-project#47525
- Was able to reproduce the failures locally and fixed test in 07efa6f. Failure was due to logical merge conflict (previous PR wasn't re-based off latest master after other event PRs were merged).

Signed-off-by: Nikita Vemuri <[email protected]>
Co-authored-by: Nikita Vemuri <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Write task events to file as part of the export API. This logic is only run if RayConfig::instance().enable_export_api_write() is true. Default value is false.
All tasks that are added to the task event buffer will be written to file. In addition, keep a dropped_status_events_for_export_ buffer which stores status events that were dropped from the buffer to send to GCS, and write these dropped events to file as well.
The size of dropped_status_events_for_export_ is 10x larger than task_events_max_num_status_events_buffer_on_worker to prioritize recording data. The tradeoff here is memory on each worker, but this is a relatively small overhead, and it is unlikely the dropped events buffer will fill given the sink for export events (write to file) will succeed on each flush.
Task events converted to the export API proto and written to file in a separate thread, which runs this flush operation periodically (every second).
Individual task events will be aggregated by task attempt before being written. This is consistent with the final event sent to GCS, and also helps reduce the number of events written to file.

Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
- Re add code changes from [observability][export-api] Write task events ray-project#47193, which was previous reverted due to CI test linux://:task_event_buffer_test is consistently_failing ray-project#47519, CI test windows://:task_event_buffer_test is consistently_failing ray-project#47523 and CI test darwin://:task_event_buffer_test is consistently_failing ray-project#47525
- Was able to reproduce the failures locally and fixed test in 07efa6f. Failure was due to logical merge conflict (previous PR wasn't re-based off latest master after other event PRs were merged).

Signed-off-by: Nikita Vemuri <[email protected]>
Co-authored-by: Nikita Vemuri <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants